1 🔍 Introduction

With the growing use of data in all aspects of life, it is extremely important to use information collected from consumers in an ethical manner. While there are multiple schools of ethics to look into, it is important to note that the use of ethical practices should allow consumers to not be directly affected, thereby taking important measures to secure such sensitive information.

However, with the global rise in cyber attacks and the unethical practices through wrongul usage of data, it is no longer sufficient to simply secure such data. Steps must be taken to prevent dissemination of personal information as well as to encrypt them so as to de-identify in the event that the information is released as open data. Through the study of Floridi and Taddeo (2016), we have come to understand that it is not a specific technology (computers, tablets, mobile phones, online platforms, cloud computing and so forth), but what any digital technology manipulates that represents the correct focus of our ethical strategies. Furthermore, the prevalence of big data today is evermore critical as a result of its usefulness. However, based on a study by Kuc-Czarnecka and Olczyk (2020), large datasets coupled with complex analytical algorithms pose the risk of non-transparency, unfairness, e.g., racial or class bias, cherry-picking of data, or even intentional misleading of public opinion, including policymakers, for example by tampering with the electoral process in the context of ‘cyberwars’. For example, various well known companies use customer data to manipulate the buying behavior of a person by specifically targeting certain products.

Hence, before disseminating information to the larger public, it is important to prepare the data in an ethical manner which does not allow for any malicious actions against individuals or corporations who appear in the data. Following are the key considerations and steps taken while preparing the “loan performance open data” by ABC bank.

1.1 🕵 Key considerations before data preparation

The key steps to be taken before the preparation of the dataset are as follows:

  1. It is important to understand the data protection laws prevalent in the geographical region the bank is located in. As the bank “ABC” is situated in the United States of America, the open data that will be published must state and federal data protection laws enforced. While there is no one single data privacy rule in the USA, however, they do have largely sector specific federal and state laws such as data security laws, secure destruction, Social Security number privacy, online privacy, biometric information privacy, and data breach notification laws. These laws can be referred to in greater detail here.

  2. It is important to understand the ways in which, data can be misused so as to take the necessary steps to prevent such an event. Below are some of the common ways that data can be misused.

    • Commingling

    Commingling is when corporates or individuals capture data of a particular audience for a specific purpose but utilize the same for a separate task without citing the rightful source. Reusing data submitted for academic research, marketing purposes or sharing client data between sister organizations without consent are some of the most common commingling scenarios.

    • Personal benefit

    Personal data may be obtained so as to use it for an organization’s or individual’s personal gain. Such type of use of data could also have a malicious intent.

    • Ambiguity

    Ambiguity occurs when organizations fail to explicitly disclose how user data is collected and what that data will be used for in a concise and accessible manner.

  3. Once the required data protection laws are studied in detail, a data ethics checklist can be used to check for the required steps taken to create the re-distribute the open data on loan performance. We can refer to the data ethics checklist here Deon badge.

  4. After going through the data, any redundant variable which does not add any form of information to the reader can be removed from the dataset. In this dataset, we observe that the variables repch_flag (Repurchase the mortgage loan) and aqsn_dte (Acquisition date) has only 0s in the entire column and hence, does not provide any insights for loan performance evaluation. As a result, this variable can be removed in the finalised dataset.

  5. Based on the data ethics checklist, below are some of the important considerations to be taken into account before preparation of the dataset:

    A. Data Collection

    A.1 Informed consent: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?”
    A.2 Collection bias: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?”
    A.3 Limit PII exposure: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn’t relevant for analysis?”
    A.4 Downstream bias mitigation: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?”

    B. Data Storage

    B.1 Data security: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?”
    B.2 Right to be forgotten: Do we have a mechanism through which an individual can request their personal information be removed?
    B.3 Data retention plan: Is there a schedule or plan to delete the data after it is no longer needed?

2 🔧 Methodology

2.1 🕹️ De-Identification Strategies

Following are the primary de-identification strategies applied on the various variables in the raw dataset:

1. Selection and removal of variables. 2. Aggregation 3. Top and Bottom coding 4. Perturbation 5. Specific date omission 6. Cell suppression 7. Value suppression

2.2 ⚠️ Removal of direct identifiers

Direct identifiers are the variables which can pin-point an individual in a dataset. Often, these variables consist of personal information which can be used with malicious intent if not removed before the data is released to the open public. The following are the variables which act as direct identifers and are removed from the raw dataset.

  • First name ( first_name )

  • Last name ( last_name )

loan_data_raw <- read_rds(here::here('raw_data/loanData.rds'))
loan_data_filtered_directid <- loan_data_raw %>% select(-c('last_name','first_name'))

2.3 🛣 Handling quasi-identifiers

Quasi-identifiers are the pieces of information which are not able to directly identify an individual but are sufficient to combine together to reasonably be able to identify information.

Various techniques will be used for suppressing sensitive information which will be delineated as follows:

2.3.1 Removal of variables

Certain quasi-identifiers will be removed from the final dataset as these set of variables may not contribute much to assess loan performance. These variables consist data of unique loan id and geographical data. Following are the list of variables that will be dropped to reduce the risk of re-identification :

  • Loan identifier ( loan_id )
  • Seller name ( seller )
  • Property state ( state )
  • Zip code short ( zip_3 )
  • Servicer name ( servicer )
  • Metropolitan Statistical Area ( msa )
loan_data_filtered_quasi <- loan_data_filtered_directid %>% select(-c('loan_id','seller','state','zip_3','servicer','msa'))

2.3.2 Aggregation and top or bottom coding

Aggregation method may be utilised for grouping data into bins. This would prevent identification of a person based on unique values observed within the dataset.

2.3.2.1 Customer Age

options(scipen = 999) #To remove scientific notation on the graph
pl1 <- ggplot(data=loan_data_filtered_quasi,aes(x=cus_age)) + geom_histogram(color='red') + ggtitle("Customer age distribution") +
  labs(x='Customer age',y='Number of people')
print(pl1)
Distribution of customer age

Figure 2.1: Distribution of customer age

As we can observe from figure 2.1, the customer ages are uniformly distributed. Although this is ideal as there are no outliers, the ages will be aggregated into age groups to aid further de-identification. In addition, the ages higher than 44 years in the distribution are bottom coded to “> 44 years”, thereby preventing identification of groups that are aged 45 years in the distribution.

loan_data_age_grouped <- loan_data_filtered_quasi %>% 
  mutate(cus_age_group = case_when((cus_age >=30 & cus_age<35) ~ "30-34",
                                   (cus_age >=35 & cus_age<40) ~ "35-39",
                                   (cus_age >=40 & cus_age<45) ~ "40-44",
                                   (cus_age >=45) ~ "> 44 years")) 

age_groups <- c("30-34","35-39","40-44","> 44 years")

pl2 <- ggplot(loan_data_age_grouped %>% group_by(cus_age_group) %>% 
                summarise(count = n()),
              aes(x=factor(cus_age_group,age_groups,age_groups), 
                  y=count,fill=cus_age_group)) + 
  geom_col(stat='identity',color='black') + theme_classic() +
  geom_text(aes(label=count),vjust=1.5) + theme(legend.position = 'none',axis.text.x = element_text(face='bold')) + ggtitle('Aggregated age') +
  labs(x="Age groups",y='Number of credit borrowers') 
print(pl2)
Count of individuals for each age group

Figure 2.2: Count of individuals for each age group

Figure 2.2 depicts the count of individuals after aggregating the customer ages into age groups and bottom coding the individuals of 45 years as “> 44 years”.

2.3.2.2 Number of dependents

The number of dependents may not directly provide information on loan performance of a bank. However, it can provide important information regarding the ability to repay back the loans without defaulting. However, as this data maybe prone to malicious misuse, hence the data maybe encoded as a binary indicator of whether a borrower has any dependents (encoded as 1) or not (encoded as 0).

loan_data_new <- loan_data_age_grouped %>% 
  mutate(have_dependents = case_when( no_depend == 0 ~ "None",
                                      no_depend == 1 ~ "Single",
                                      no_depend > 1 ~ "Multiple")) 

pl3 <- ggplot(data = loan_data_new %>% 
                count(have_dependents),
              aes(x=have_dependents,y=n,fill=have_dependents)) + geom_col(color='black') + theme_classic() + coord_flip() + labs(x='Any dependents for the borrower ?',y='Count of borrowers') + theme(legend.position = 'none') 
print(pl3) 
Indicator for dependets of borrwer

Figure 2.3: Indicator for dependets of borrwer

2.3.2.3 Original Interest Rate

Original interest rates are generally well known to customers and hence, can be easily re-identified by the individual in a dataset. There are a total of 489 unique values in the dataset.

pl13 <- ggplot(data= loan_data_new,aes(x=orig_rt,fill='red')) + geom_histogram(alpha=0.7,color='black') + theme_classic() +
  labs(x='Original interest rate (in %)', y="Number of individuals") + ggtitle("Distribution of original interest rate") +theme(legend.position = 'none',plot.title = element_text(hjust=0.5))
print(pl13)
Distribution of the original interest rates

Figure 2.4: Distribution of the original interest rates

As we can observe from figure 2.4, majority of the interest rates lie between 2.5 % and 6.5 %. One way to de-identify the loan interest rates would be to aggregate the interest rates as shown below. Figure 2.5 depicts the number of borrowers based on the various aggregated interest rates.

  • 0 - 2.5 %
  • 2.5 - 3.5 %
  • 3.5 - 4.5 %
  • 4.5 - 5.5 %
  • 5.5 - 6.5 %
  • > 6.5 %
loan_data_new <- loan_data_new %>% mutate(orig_rt_grouped = case_when((orig_rt <= 2.5) ~ "0-2.5",
                                   (orig_rt > 2.5 & orig_rt <= 3.5) ~ "2.5-3.5",
                                   (orig_rt > 3.5 & orig_rt <= 4.5) ~ "3.5-4.5",
                                   (orig_rt > 4.5 & orig_rt <= 5.5) ~ "4.5-5.5",
                                   (orig_rt > 5.5 & orig_rt <= 6.5) ~ "5.5-6.5",
                                   (orig_rt > 6.5 ) ~ "> 6.5"))

loan_breaks <- c("0-2.5","2.5-3.5","3.5-4.5","4.5-5.5","5.5-6.5","> 6.5")


pl14 <- ggplot(data= loan_data_new %>% count(orig_rt_grouped), aes(x=factor(orig_rt_grouped,loan_breaks,loan_breaks),y=n,fill=orig_rt_grouped)) + theme_classic() +
  geom_text(aes(label=n),vjust=-0.7,size=3) + 
  geom_col(color='black',alpha=0.7) + labs(x='Original Interest Rate (in %)',y='Number of borrowers') + labs(fill='Interest rate') + ggtitle("Original interest rates after aggregation") + theme(plot.title = element_text())

print(pl14)
Original interest rates after aggregation

Figure 2.5: Original interest rates after aggregation

2.3.2.4 Last loan rate

The last loan rate is a value that can be easily identified by a person similar to what was discussed in section 2.3.2.3. Hence, a similar aggregation strategy will be implemented to aid the de-identification process.

pl18 <- ggplot(data= loan_data_new,aes(x=last_rt)) + geom_histogram(alpha=0.7,color='black',fill='#6699ff') + theme_classic() +
  labs(x='Last interest rate (in %)', y="Number of individuals") + ggtitle("Distribution of last interest rate") +theme(legend.position = 'none',plot.title = element_text(hjust=0.5))

print(pl18)
Distribution of the last interest rates

Figure 2.6: Distribution of the last interest rates

Figure 2.7 depicts the number of borrowers based on the various aggregated interest rates.

  • 0 - 2.5 %
  • 2.5 - 3.5 %
  • 3.5 - 4.5 %
  • 4.5 - 5.5 %
  • 5.5 - 6.5 %
  • > 6.5 %
loan_data_new <- loan_data_new %>% mutate(last_rt_grouped = case_when((last_rt <= 2.5) ~ "0-2.5",
                                   (last_rt > 2.5 & last_rt <= 3.5) ~ "2.5-3.5",
                                   (last_rt > 3.5 & last_rt <= 4.5) ~ "3.5-4.5",
                                   (last_rt > 4.5 & last_rt <= 5.5) ~ "4.5-5.5",
                                   (last_rt > 5.5 & last_rt <= 6.5) ~ "5.5-6.5",
                                   (last_rt > 6.5 ) ~ "> 6.5"))

loan_breaks <- c("0-2.5","2.5-3.5","3.5-4.5","4.5-5.5","5.5-6.5","> 6.5")


pl19 <- ggplot(data= loan_data_new %>% count(last_rt_grouped), aes(x=factor(last_rt_grouped,loan_breaks,loan_breaks),y=n,fill=last_rt_grouped)) + theme_classic() +
  geom_text(aes(label=n),vjust=-0.7,size=3) + 
  geom_col(color='black',alpha=0.7) + labs(x='Last Interest Rate (in %)',y='Number of borrowers') + labs(fill='Interest rate') + ggtitle("Last interest rates after aggregation") + theme(plot.title = element_text()) + scale_fill_brewer(palette = 'Paired')

print(pl19)
Original interest rates after aggregation

Figure 2.7: Original interest rates after aggregation

2.3.2.5 Credit score of borrower

The credit industry in the US utilises a FICO score (C-score) to assess the ability of a borrower to repay back the credited money. Based on the credit score, a bank or any finanical institution may decide to extend credit support of a borrower. Based on Forbes, the credit score of a borrower may be divided into specific categories. These categories can be created through data aggregation which will provide us with the added benefit of de-identification of data containing unique values of C-scores. Table 2.1 are the aggregations that will be performed for the Credit risk scores of borrower and co-borrower.

Scores <- c("0-580","580-670","670-740","740-800","800-850")
Ratings <- c("Poor","Fair","Good","Very Good","Excellent")

df_ratings <- data.frame(Scores,Ratings)

df_ratings %>% kable(caption = 'Credit score ratings',booktabs = TRUE) %>% 
  kable_styling(bootstrap_options = c("bordered","hover")) %>%
  row_spec(0,background="rgb(172,175,145)",color='black',font_size = 18)
Table 2.1: Credit score ratings
Scores Ratings
0-580 Poor
580-670 Fair
670-740 Good
740-800 Very Good
800-850 Excellent
loan_data_new <- loan_data_new %>% 
  mutate(cscore_b_rate = case_when( cscore_b > 0 & cscore_b <580 ~ "Poor",
                               cscore_b >= 580 & cscore_b <670 ~ "Fair",
                               cscore_b >=670  & cscore_b <740 ~ "Good",
                               cscore_b >= 740 & cscore_b < 800 ~ "Very Good",
                               cscore_b >= 800 ~ "Excellent"
                               )) 



pl6 <- ggplot(data= loan_data_new %>% 
                count(cscore_b_rate), aes(x = "", y=n, fill = cscore_b_rate)) + 
  geom_col(color='black') + 
  theme_classic() +
  coord_polar(theta='y') + 
  ggtitle("Aggregated borrower credit score") +
  geom_label_repel(aes(label = n),
             position = position_stack(vjust = 0.5),
             show.legend = FALSE)  +
  theme(axis.title.x = element_blank(), axis.title.y = element_blank(), axis.text = element_blank()) + labs(fill = 'Borrower credit score rating')
print(pl6)
Aggregated Credit Score of Borrower

Figure 2.8: Aggregated Credit Score of Borrower

As we can observe from the pie chart in figure 2.8, there is only one individual with a borrower credit score of 0-580 (Poor category). Since this person can easily re-identify themselves, we can exercise value suppression by changing the borrower’s credit score to a null value (NA) which will prevent any possible re-identification as depicted by figure 2.9.

loan_data_new <- loan_data_new %>% mutate(cscore_b_rate =  ifelse(cscore_b_rate=="Poor",NA,cscore_b_rate))



pl7 <- ggplot(data= loan_data_new %>% 
                count(cscore_b_rate), aes(x = "", y=n, fill = cscore_b_rate)) + 
  geom_col(color='black') + 
  theme_classic() +
  coord_polar(theta='y') + 
  ggtitle("Aggregated borrower credit score after value suppression") +
  geom_label_repel(aes(label = n),
             position = position_stack(vjust = 0.5),
             show.legend = FALSE)  +
  theme(axis.title.x = element_blank(), 
        axis.title.y = element_blank(), 
        axis.text = element_blank()) + labs(fill = 'Borrower credit score rating')
print(pl7)
Aggregated borrower's credit score with value suppression

Figure 2.9: Aggregated borrower’s credit score with value suppression

Figure 2.10 depicts a pie chart of the spread of the credit score for co-borrowers if applicable. As there are fair number of individuals in each category, it can be reasonably assumed that no single individual will be able to re-identify themselves through this aggregation.

loan_data_new <- loan_data_new %>% 
  mutate(cscore_c_rate = case_when( cscore_c > 0 & cscore_c <580 ~ "Poor",
                               cscore_c >= 580 & cscore_c <670 ~ "Fair",
                               cscore_c >=670  & cscore_c <740 ~ "Good",
                               cscore_c >= 740 & cscore_c < 800 ~ "Very Good",
                               cscore_c >= 800 ~ "Excellent"
                               )) 


pl8 <- ggplot(data= loan_data_new %>% 
                count(cscore_c_rate), aes(x = "", y=n, fill = cscore_c_rate)) + 
  geom_col(color='black') + 
  theme_classic() +
  coord_polar(theta='y') + 
  ggtitle("Aggregated co-borrower credit score") +
  geom_label_repel(aes(label = n),
             position = position_stack(vjust = 0.5),
             show.legend = FALSE)  +
  theme(axis.title.x = element_blank(), axis.title.y = element_blank(), axis.text = element_blank()) + labs(fill = "Co-borrower's credit score rating")
print(pl8)
Aggregated co-borrower's credit score with value suppression

Figure 2.10: Aggregated co-borrower’s credit score with value suppression

2.3.3 Number of borrowers

As the number of borrowers would be well known by a customer, hence the outliers must be masked using top coding technique.

orig_borrowers <- loan_data_new %>% count(num_bo)


pl16 <- ggplot(data=orig_borrowers, aes(x=num_bo,y=n,fill=as.factor(num_bo))) + geom_text(aes(label=n),vjust=-0.7,size=3) + 
  geom_col(alpha=0.7,color='black') + theme_classic() + scale_x_continuous(labels= 1:7,breaks = 1:7) + labs(x="Number of borrowers",y="Count",fill='Number of \n borrowers registering for a loan') +  
  ggtitle("Before top encoding") + 
  theme(legend.title = element_text(size = 6),plot.title=element_text(hjust=0.5))
print(pl16)
Number of borrwers for a loan before top encoding

Figure 2.11: Number of borrwers for a loan before top encoding

As we can observe from figure 2.11, there are only 5 cases with 6 borrowers for the loan and 1 case with 7 borrowers for the loan. These individuals maybe able to re-identify themselves due to their low population count. Hence, a top encoding strategy is implemented to mask all the outliers to aid in the de-identification process as shown by figure 2.12.

loan_data_new <- loan_data_new %>% mutate(num_bo_new = ifelse(num_bo >=5,"> 4",num_bo))

borrowers_new <- c("1","2","3","4","> 4")

pl17 <- ggplot(data= loan_data_new %>% count(num_bo_new), 
               aes(x=factor(num_bo_new,borrowers_new,borrowers_new),
                   y=n,fill=as.factor(num_bo_new))) + geom_col(color='black') + 
  geom_text(aes(label=n),vjust=-0.7,size=3) + 
  theme_classic() +
  labs(x="Number of borrowers", y="Count") + ggtitle("After top encoding") +
  theme(legend.position = 'none',plot.title = element_text(hjust=0.5),axis.text.x = element_text(face='bold')) 

                                                                                                                    
print(pl17)
Number of borrowers after top encoding

Figure 2.12: Number of borrowers after top encoding

2.3.3.1 Debt to income ratio

Debt and income are values that are known to an individual borrower and hence, can use this data to re-identify oneself in a dataset. Currently, there are 3 unique values in the dataset which will be subsequently de-identified using a top encoding strategy. Table 2.2 depicts the top coded term “> 50” debt to income which prevents any form of re-identification.

loan_data_new <- loan_data_new %>% mutate(dti_new = ifelse(dti > 50,"> 50",dti))

dti_out <- loan_data_new %>% count(dti_new) %>% arrange(n) %>% head(5)
dti_out <- dti_out %>% rename("Debt to income ratio"="dti_new","Count"="n")

dti_out %>% kable(caption = 'Debt to income ratio with lowest counts after',booktabs = TRUE) %>% 
  kable_styling(bootstrap_options = c("bordered","hover")) %>%
  row_spec(0,background="rgb(172,175,145)",color='black',font_size = 18)
Table 2.2: Debt to income ratio with lowest counts after
Debt to income ratio Count
> 50 24
1 270
2 299
3 421
4 601

2.3.3.2 Unpaid balance at the time of removal

The unpaid balance (UPB) at the end of a loan term can be computed by a borrower to re-identify oneself in a dataset. An aggregation strategy maybe utilized to implement the required de-identification strategy. Figure 2.13 depicts the distribution of the unpaid balance data alongwith the aggregated factors.

Following are the factors applied to this data and the values against each level is depicted by figure 2.14 :

  • 0-1000 $ : No UPB or low UPB
  • 1000-50000 $: Level 1 (L1)
  • 50000-100000 $: Level 2(L2)
  • 100000-250000 $: Level 3(L3)
  • 250000-518000 $: Level 4(L4)
  • Greater than 518000 $ : Level 5(L5)
outlier_upb <- IQR(loan_data_new$last_upb,na.rm=TRUE) * 1.5 + quantile(loan_data_new$last_upb,0.75,na.rm=TRUE)

pl20 <- ggplot(data=loan_data_new,aes(x=last_upb)) + geom_histogram(color='black',fill='#ffdb4d',alpha=0.7) + 
  theme_classic() + labs(x="Last unpaid balance (in $)",y="Number of borrowers") + 
  ggtitle("Distribution of UPB before aggregation") + 
  theme(plot.title = element_text(hjust=0.5)) +
  annotate("segment", x = 1, xend = 1, y = 0, yend = 750000,colour = "red",alpha=0.6,linetype=3) +
  annotate("segment", x = outlier_upb  , xend = outlier_upb, y = 0, yend = 750000,colour = "red",alpha=0.6,linetype=3) +
  annotate("segment", x = 50000  , xend = 50000, y = 0, yend = 750000,colour = "red",alpha=0.6,linetype=3) +
  annotate("segment", x = 100000  , xend = 100000, y = 0, yend = 750000,colour = "red",alpha=0.6,linetype=3) +
  annotate("segment", x = 250000  , xend = 250000, y = 0, yend = 750000,colour = "red",linetype=3) +
  annotate("text", x = 0, y= 800000, colour = "darkgreen",label='Low/Nil UPB',size = unit(3, "pt")) +
  annotate("text", x = 25000, y= 600000, colour = "black",label='L1',size = unit(3, "pt")) +
  annotate("text", x = 80000, y= 700000, colour = "black",label='L2',size = unit(3, "pt")) +
  annotate("text", x = 200000, y= 700000, colour = "#ff884d",label='L3',size = unit(3, "pt")) +
  annotate("text", x = 400000, y= 700000, colour = "#ff661a",label='L4',size = unit(3, "pt")) +
  annotate("text", x = 1000000, y= 700000, colour = "red",label='L5 (Outliers masked in this level)',size = unit(4, "pt"))
  
print(pl20)
Distribution of UPB before aggregation

Figure 2.13: Distribution of UPB before aggregation

loan_data_new <- loan_data_new %>% mutate(last_upb_grouped= case_when((last_upb <= 1000) ~ "Low/No UPB",
                                   (last_upb > 1000 & last_upb <= 50000 ) ~ "L1",
                                   (last_upb > 50000 & last_upb <= 100000) ~ "L2",
                                   (last_upb > 100000 & last_upb <= 250000) ~ "L3",
                                   (last_upb > 250000 & last_upb <= 518000) ~ "L4",
                                   (last_upb > 518000) ~ "L5"))

upb_levels <- c("Low/No UPB","L1","L2","L3","L4","L5","NA")

pl21 <- ggplot(data=loan_data_new %>% count(last_upb_grouped),
               aes(x=factor(last_upb_grouped,upb_levels,upb_levels),
                   y=n,
                   fill =last_upb_grouped)) + 
  geom_text(aes(label=n),nudge_y = 50000) + theme_classic() +
  geom_col(color='black') + scale_fill_brewer(palette = "Dark2") +
  labs(x="Levels for Unpaid Balance (UPB)",y="Number of borrowers") + 
  ggtitle("Distribution of UPB after aggregation") +
  theme(legend.position = 'none',axis.text.x = element_text(face='bold'), 
        axis.text.y = element_text(face='bold'),
        plot.title = element_text(hjust=0.5)) 
  

print(pl21)
Aggregation performed to unpaid balance

Figure 2.14: Aggregation performed to unpaid balance

2.3.4 🛠️ Data Perturbration and Cell Suppression

De-identification through the use of perturbation and suppression techniques can be useful to help de-identify any unique values in the dataset. However, the noise added to the dataset must be reasonable and should not drastically skew the dataset.

2.3.4.1 Income of main borrower

Since the income of borrowers will be well known to them, in order to mask these values, data can be perturbed by creating noise in the data. For outliers, we can apply cell suppression to replace these high income values with null values. Figure 2.15 depicts the boxplots indicating the current distribution of the incomes for various age groups before the addition of noise through perturbation and consequent cell suppression.

Upon going through the dataset for the income, it can be observed that some particular values of incomes are unique and can be easily re-identified. Hence, a reasonable amount of noise is applied to the dataset to de-identify the data without drastically changing the distribution across the various age-groups.

The noise addition was performed as a function of the overall magnitude of the income. The following strategy was applied while adding noise to the dataset.

  • All noise additions were performed using a normal distribution.
  • For income values between 0 percentile (20000$) to 25th percentile (255296 $) , the noise was normally distributed between 10-250 $.
  • For income values between 25th percentile to 75th percentile (755076 $) , the noise was normally distributed between 50-500 $.
  • For income values between 75th percentile to 99th percentile (990189 $) , the noise was normally distributed between 500-1000 $.
  • For income values between 99th percentile to 100th percentile (100000 $) , the new incomes were replaced by null (NA) values as these represent the top 1 percentile of the earning population and can be re-identified due to the low population in the sub-category. Hence, cell suppression technique was applied.

As observed through figure 2.16, the boxplots for various age groups after noise addition and cell suppression indicate a similar distribution to the original incomes for the various age groups in the dataset.

pl4<- ggplot(data = loan_data_new, aes(y= income,x=)) + geom_boxplot(color='red') + labs(y='Income of main borrower ($)') + facet_grid(~factor(cus_age_group,levels = age_groups)) + ggtitle("Income distribution for various age groups before added noise and cell suppression") +
  theme(axis.text.x = element_blank())
print(pl4)
Original income distribution

Figure 2.15: Original income distribution

loan_data_new <- loan_data_new %>% 
  mutate(new_income = case_when( 
  (income < quantile(income,0.25)) ~ (income + rnorm(n(),10,250)),
  
  (income >= quantile(income,0.25)) & (income < quantile(income,0.75)) ~ (income + rnorm(n(),50,500)) ,
  
  (income >= quantile(income,0.75)) & (income < quantile(income,0.99)) ~ (income + rnorm(n(),500,1000)))) #Perturbation and cell suppression

pl5<- ggplot(data = loan_data_new, aes(y= new_income,na.rm=TRUE)) + geom_boxplot(color='red') + labs(y='Income of main borrower ($)') +  facet_grid(~factor(cus_age_group,levels = age_groups)) + ggtitle("Income distribution after noise and suppression for various age groups") + theme(axis.text.x=element_blank())
print(pl5)
Income distribution after perturbation and suppression

Figure 2.16: Income distribution after perturbation and suppression

2.3.4.2 First paid installment date

The first paid installment date will be known to the customer as it is well accounted in their payment summary of their respective banks. Hence, this data can be easily compared with past records for re-identification. The raw dataset contains a total of 4 dates that can be uniquely identified as observed through 2.3. In this approach, we add random noise to the data in the form of months which will aid in the de-identification process. Table 2.4 lists the original unique first payment date and the new date created to replace these dates after introducing random noise.

df_unique <- loan_data_new %>% count(frst_dte) %>% filter(n==1)

unique_dates <- as.Date(df_unique$frst_dte)

df_unique <-data.frame(unique_dates) %>% rename("Unique First Payment Dates"="unique_dates")

df_unique %>% kable(caption = 'Unique first payment date',booktabs = TRUE) %>% 
  kable_styling(bootstrap_options = c("bordered","hover")) %>%
  row_spec(0,background="rgb(172,175,145)",color='black',font_size = 18)
Table 2.3: Unique first payment date
Unique First Payment Dates
2013-01-01
2013-09-01
2014-12-01
2015-02-01
loan_data_new$frst_dte <- as.Date(loan_data_new$frst_dte) 


loan_data_new <- loan_data_new %>% mutate(frst_dte_new = ifelse(frst_dte %in% unique_dates, frst_dte + months(floor(runif(1, min=2, max=3)))  ,frst_dte))

loan_data_new$frst_dte_new<- as.Date(loan_data_new$frst_dte_new, origin="1970-1-1")


df_new_dates <- loan_data_new %>% filter(frst_dte %in% unique_dates) %>% select(c(frst_dte,frst_dte_new))

df_new_dates <- df_new_dates %>% rename("Original First Payment Date"="frst_dte","New First Payment Date"="frst_dte_new")


df_new_dates %>% kable(caption = 'Original and new first payment dates after noise addition',booktabs = TRUE) %>% 
  kable_styling(bootstrap_options = c("bordered","hover")) %>%
  row_spec(0,background="rgb(172,175,145)",color='black',font_size = 18)
Table 2.4: Original and new first payment dates after noise addition
Original First Payment Date New First Payment Date
2015-02-01 2015-04-01
2013-09-01 2013-11-01
2014-12-01 2015-02-01
2013-01-01 2013-03-01

2.3.4.3 Last paid installment date

The last paid installment date will be well known to the customer as it is well accounted in their payment summary of their respective banks. Hence, this data can be easily compared with past records for re-identification. The raw dataset contains a total of 5 dates that can be uniquely identified as observed through 2.5. In this approach, we add random noise to the data in the form of months which will aid in the de-identification process.

df_unique <- loan_data_new %>% count(lpi_dte) %>% filter(n==1)

unique_dates <- as.Date(df_unique$lpi_dte)

df_unique <-data.frame(unique_dates) %>% rename("Unique Last Payment Dates"="unique_dates")

df_unique %>% kable(caption = 'Last payment date',booktabs = TRUE) %>% 
  kable_styling(bootstrap_options = c("bordered","hover")) %>%
  row_spec(0,background="rgb(172,175,145)",color='black',font_size = 18)
Table 2.5: Last payment date
Unique Last Payment Dates
2017-01-01
2017-03-01
2020-11-01
2021-02-01
2021-03-01
loan_data_new$lpi_dte <- as.Date(loan_data_new$lpi_dte) 


loan_data_new <- loan_data_new %>% mutate(lpi_dte_new = ifelse(lpi_dte %in% unique_dates, lpi_dte + months(floor(runif(1, min=1, max=3)))  ,lpi_dte))


loan_data_new$lpi_dte_new<- as.Date(loan_data_new$lpi_dte_new, origin="1970-1-1")


df_new_dates <- loan_data_new %>% filter(lpi_dte %in% unique_dates) %>% select(c(lpi_dte,lpi_dte_new))

df_new_dates <- df_new_dates %>% rename("Original Last Payment Date"="lpi_dte","New Last Payment Date"="lpi_dte_new")


df_new_dates %>% kable(caption = 'Original and new last payment dates after noise addition',booktabs = TRUE) %>% 
  kable_styling(bootstrap_options = c("bordered","hover")) %>%
  row_spec(0,background="rgb(172,175,145)",color='black',font_size = 18)
Table 2.6: Original and new last payment dates after noise addition
Original Last Payment Date New Last Payment Date
2017-03-01 2017-05-01
2017-01-01 2017-03-01
2020-11-01 2021-01-01
2021-02-01 2021-04-01
2021-03-01 2021-05-01

As we can observe from table 2.6, random number of months between 1 to 3 months were added to the original dates which does not drastically alter the original event date but also successfully allows de-identification of the last payment date done by the borrower.

2.3.4.4 Loan age

Loan age is the number of calendar months since the origination of the mortgage loan. As this data can be easily recorded by a borrower, the unique values need to be de-identified. Currently, there are 3 unique values in the dataset. Random data between 1 to 3 months will be introduced in this dataset.

unique_loan_age <- loan_data_new %>% count(loan_age) %>% filter(n==1) %>% pull(loan_age)

loan_data_new <- loan_data_new %>% 
  mutate(loan_age_new= ifelse(loan_age %in% unique_loan_age,
                              floor(runif(1,min=1,max=3)) + loan_age,
                              loan_age))

loan_age_table <- data.frame(loan_data_new %>% filter(loan_age %in% unique_loan_age)) %>% select(c(loan_age,loan_age_new))
loan_age_table <- loan_age_table %>% rename("Unique original loan age"="loan_age",
                                            "New loan age"="loan_age_new")

loan_age_table %>% kable(caption = 'Loan age data before and after perturbation ',booktabs = TRUE) %>% 
  kable_styling(bootstrap_options = c("bordered","hover")) %>%
  row_spec(0,background="rgb(172,175,145)",color='black',font_size = 18)
Table 2.7: Loan age data before and after perturbation
Unique original loan age New loan age
76 78
97 99
82 84

As we can see from table 2.7, the unique loan age data has some random noise introduced in it to aid in de-identification. The noise provided is upto a reasonable limit any change to the distribution of the loan age data as can be observed from figure @reg(fig:loanage).

loan_age_long <-loan_data_new %>% pivot_longer(cols = c(loan_age,loan_age_new),
                                                names_to = 'loan_age_version',
                                                values_to = 'loan_age_value')
pl15 <- ggplot(data=loan_age_long,aes(y=loan_age_value,fill=loan_age_version)) + 
  geom_boxplot() + facet_wrap(~loan_age_version) +
  theme_classic() + labs(y='Loan age value',fill= 'Loan age') + 
  ggtitle('Distribution of loan age before and after noise') + 
  theme(plot.title = element_text(hjust=0.5),axis.text.x = element_blank(),axis.ticks.x = element_blank())
                                                                                                                  
print(pl15)
Distribution of loan age before and after noise

Figure 2.17: Distribution of loan age before and after noise

2.3.5 🪚 Ommision of specific date

2.3.5.1 Loan first modifcation date

Upon looking into the loan first modification date, we can observe that while the data only contains the month and the year, however for the year of 2017, the individual maybe uniquely identified as there is only one observation in the month of October.

loan_data_new$fmod_dte <- ymd(loan_data_new$fmod_dte)

mod_17 <- loan_data_new %>% filter(year(fmod_dte)==2017)

borrowers <- 0:10
pl9 <- ggplot(data = mod_17, aes(x=fmod_dte)) + geom_bar(fill='blue',color='black') + 
  labs(y='Number of borrowers',x='Month of loan modification date') + theme_classic() +
  ggtitle("Loan modification month in 2017") + theme(plot.title = element_text(hjust=0.5)) +
  scale_y_continuous(labels = as.character(borrowers),breaks = borrowers)
print(pl9)
Loan modification month in 2017

Figure 2.18: Loan modification month in 2017

As we can observe figure 2.18, the individual who got his loan modified in the month of October for the year 2017 can re-identify themselves. We can use the technique of Ommision of specific date which would mask a specific date to aid in the de-identification process. In this process, we can cluster a specific date to a period where multiple other modification dates are present.

loan_data_new <- loan_data_new %>% 
  mutate(fmod_dte_new = ifelse(year(fmod_dte)==2017 & month(fmod_dte)==10,
                               ymd("2017-12-01"),
                               fmod_dte)) %>% select(-fmod_dte)

loan_data_new$fmod_dte_new <- as.Date(loan_data_new$fmod_dte_new, origin="1970-1-1") #Important to return values to date format after mutate


pl10 <- ggplot(data = loan_data_new %>% filter(year(fmod_dte_new)==2017), aes(x=month(fmod_dte_new,label=TRUE))) + geom_bar(fill='#007f00',color='black') + 
  labs(y='Number of borrowers',x='Month of loan modification date') + theme_classic() +
  ggtitle("Loan modification months in 2017 after date suppression") + theme(plot.title = element_text(hjust=0.5)) +
  scale_y_continuous(labels = as.character(borrowers),breaks = borrowers) 
print(pl10)
Loan modification month after date suppression

Figure 2.19: Loan modification month after date suppression

As we can observe from 2.19, the single individual with loan modification date in October 2017 has been added to the data of loan modification date in December 2017. This will prevent re-identification of the person from the dataset.

2.3.6 🔩 Rounding data and top coding

Using this technique, certain values are either rounded up or rounded down to the nearest value to match up with other values so as to prevent presence of any unique values which can be used for re-identification. In addition, we can utilise top coding to create an aggregated data which would aid in de-identification.

2.3.6.1 Account closure balance

As can be observed in the figure 2.20, the histogram has a right tail skew indicating high account closure balances are scattered in low numbers. Hence, these outliers will be top encoded to reduce the number unique values in the dataset. Currently, there are 241 unique values in the dataset. We will now utilise value rounding and top encoding to reduce the number of unique account closure balance values in the dataset.

unique_fce_upb <- loan_data_new %>% count(fce_upb) %>% filter(n==1)

pl11 <- ggplot(data=loan_data_new,aes(x=fce_upb)) + geom_histogram(alpha=0.7,color='red') + theme_classic() +
  labs(x= 'Account closure balance ($)',y= 'Number of individuals') + ggtitle("Account closure balance before rounding and top coding")
print(pl11)
Distribution of Account closure balance before rounding data

Figure 2.20: Distribution of Account closure balance before rounding data

fce_upb_outlier <- (IQR(loan_data_new$fce_upb,na.rm = TRUE)*1.5) + quantile(loan_data_new$fce_upb,na.rm=TRUE,0.75)

loan_data_new <- loan_data_new %>% mutate(fce_upb_new=ifelse(fce_upb>=fce_upb_outlier,"> 585000",as.numeric(round(fce_upb,-3))))

pl12 <- ggplot(data=loan_data_new,aes(x=as.numeric(fce_upb_new))) + geom_histogram(alpha=0.7,color='green') + theme_classic() +
  labs(x= 'Account closure balance ($)',y= 'Number of individuals') + ggtitle("Account closure balance after rounding and top coding")
print(pl12)
Account closure balance after rounding and top coding

Figure 2.21: Account closure balance after rounding and top coding

After performing the required operations, we can observe that the tail no longer appears in the histogram as can be observed from figure 2.21. The number of unique values in the dataset have now reduced to 2. Since the number of unique values are minuscule when compared the entire dataset, we can replace these unique values with “Null (NA) values” to prevent any possible de-identification.

unique_list <- loan_data_new %>% 
                        count(fce_upb_new) %>% 
                        filter(n==1) %>%
                        select(fce_upb_new) 

                    
loan_data_new<- loan_data_new %>% 
  mutate(fce_upb_final = ifelse( fce_upb %in% pull(unique_list,fce_upb_new),NA,as.numeric(fce_upb_new)))

After removing the unique values of account closure balance, we now have 0 similar account closure values which will aid in the de-identification process.

2.3.6.2 Original loan term

Original loan terms are well known to the borrower at the time of loan disbursal. As a result, these values may require to be de-identified. The original dataset contains 32 unique values.

In this approach, the original loan term will be converted from the months to years. Additionally, these values will be rounded to the nearest whole number. Hence, an approximate value of the original loan term will be present. Additionally, for the original terms less than 5 years, they will be bottom coded to the category “< 5 years”. As we can observe from table 2.8, there are no unique loan terms (years).

loan_data_new <- loan_data_new %>% mutate(orig_trm_yr= round(orig_trm/12))

loan_data_new <- loan_data_new %>% mutate(orig_trm_yr = ifelse(round(orig_trm/12) < 5,
                                                               "< 5 years",round(orig_trm/12)))

orig_trm_yr_table <- loan_data_new %>% count(orig_trm_yr) %>% arrange(n) %>% head(5)
orig_trm_yr_table <- orig_trm_yr_table %>% rename("Original loan term (in years" = "orig_trm_yr",
                                                  "Number of borrowers" = "n")

orig_trm_yr_table %>% kable(caption = 'Least original loan terms (in years)',booktabs = TRUE) %>% 
  kable_styling(bootstrap_options = c("bordered","hover")) %>%
  row_spec(0,background="rgb(172,175,145)",color='black',font_size = 18)
Table 2.8: Least original loan terms (in years)
Original loan term (in years Number of borrowers
< 5 years 2
6 6
5 9
7 52
21 88

2.3.7 💡 Final data preparation

ommitted_fields <- c("last_upb","no_depend","cus_age","orig_amt",
                     "cscore_b","cscore_c","num_bo","dti","last_upb",
                     "income","last_upb","frst_dte","fce_upb",
                     "orig_trm","repch_flag","orig_trm","orig_rt","last_rt",
                     "num_bo","fce_upb","lpi_dte","loan_age","income","cus_age","aqsn_dte")

loan_data_new <- loan_data_new %>% select(-c(ommitted_fields))

write_csv(loan_data_new,here::here("data/release-data-Baruah-Arindom.csv"))

3 👓 Conlusion

The primary aim for any open data release is to make the release dataset suitable for public use which prevents unethical and malicious usage. Hence, it is critical to utilise an appropriate de-identification strategies to prevent recognising details of an individual. Appropriate de-identification methods were implemented to the raw dataset and all unique details were de-identified.

4 🕸️ Data Source and Usage Policy

The dataset is an open data obtained from Fannie Mae. Fannie Mae allows the reuse of its data under certain terms and conditions such as for academic and research purposes and is subject to “Royalty-free – Internal Use Only Terms and Conditions”.More details on the terms and conditions for usage of the data can be here.

5 💻 Packages and software used

  1. RStudio : RStudio Team (2020). RStudio: Integrated Development for R. RStudio, PBC, Boston, MA URL http://www.rstudio.com/.

  2. Tidyverse : Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). “Welcome to the tidyverse.” Journal of Open Source Software, 4(43), 1686. doi:10.21105/joss.01686 https://doi.org/10.21105/joss.01686.

  3. ggplot2 : H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.

  4. lubridate : Garrett Grolemund, Hadley Wickham (2011). Dates and Times Made Easy with lubridate. Journal of Statistical Software, 40(3), 1-25. URL https://www.jstatsoft.org/v40/i03/.

  5. ggrepel : Slowikowski K (2023). ggrepel: Automatically Position Non-Overlapping Text Labels with ‘ggplot2’. R package version 0.9.3, https://CRAN.R-project.org/package=ggrepel.

  6. scales : Wickham H, Seidel D (2022). scales: Scale Functions for Visualization. R package version 1.2.1, https://CRAN.R-project.org/package=scales.

  7. here : Müller K (2020). here: A Simpler Way to Find Your Files. R package version 1.0.1, https://CRAN.R-project.org/package=here.

  8. kableExtra : Zhu H (2021). kableExtra: Construct Complex Table with ‘kable’ and Pipe Syntax. R package version 1.3.4, https://CRAN.R-project.org/package=kableExtra.

💎 Reference

Floridi, Luciano, and Mariarosaria Taddeo. 2016. “What Is Data Ethics?” Philosophical Transactions A of the Royal Society 374 (November). https://doi.org/10.1098/rsta.2016.0112.
Kuc-Czarnecka, Marta, and Magdalena Olczyk. 2020. “How Ethics Combine with Big Data: A Bibliometric Analysis.” Humanities and Social Sciences Communications 7 (1): 1–9.